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ABSTRACT 



Methods are discussed that can be used to investigate the 
fit of an item score pattern to a test model. Model-based tests and 
personality inventories are administered to more than 100 million people a 
year and, as a result, individual fit is of great concern. Item Response 
Theory (IRT) modeling and person- fit statistics that are formulated in the 
context of IRT take a prominent place in the literature. Person-fit 
statistics are extensively discussed in this paper. Also, methods formulated 
outside the IRT context and methods to investigate particular types of 
response behavior are discussed. The aim of this paper is to give the 
researcher an idea of the possibilities in this research area by emphasizing 
the similarities of most person-fit methods and by discussing the pros and 
cons of the methods. (Contains 98 references and a list of University of 
Twente research reports.) (Author/SLD) 
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METHODOLOGY REVIEW: PERSON FIT - 2 

Methods are discussed that can be used to investigate the fit of an item score pattern to 
a test model. Model-based tests and personality inventories are administered to more than 
100 million people a year and, as a result, individual fit is of great concern. Item response 
theory (IRT) modeling and person-fit statistics that are formulated in the context of IRT 
take a prominent place in the literature. Person-fit statistics are extensively discussed in 
this paper. Also, methods formulated outside the IRT context and methods to investigate 
particular types of response behavior are discussed. The aim of this paper is to give the 
researcher an idea of the possibilities in this research area by emphasizing the similarities 
of most person-fit methods and by discussing the pro’s and con’s of the methods. 

Index terms: answer copying, appropriateness measurement, item response theory, 
person-fit measurement, test theory. 
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Inaccuracy of measurement is central to measurement theorists, measurement prac- 
titioners and educational policy makers. Since the beginning of standardized testing, in- 
accuracy of measurement has received widespread attention. Examples are reliability 
theory and methods for estimating reliability (Gulliksen. 1950; Lord & Novick, 1968; 
Spearman, 1910), statistics for comparing groups with respect to the probability of cor- 
rectly answering an item (differential item functioning; e.g., Millsap & Everson, 1993), 
and differential prediction of subgroups of persons using moderated multiple regression 
analysis (e.g., Aguinis & Stone-Romero, 1997). In this review, we will give attention to 
research methods for determining the fit of individual item score patterns to a test model. 

Researchers always have shown interest in obtaining additional information to the 
total score by studying patterns of individual item scores (e.g., Nunnally, 1978). Dis- 
criminant analysis and cluster analysis have been used to cluster similar types of score 
patterns for testing the hypothesis that a priori hypothesized groups can be distinguished 
(discriminant analysis) or to discover in an exploratory sense groups that have similar re- 
sponse patterns (cluster analysis). Both methods concentrate on groups, not on individual 
persons. In this review, methods are discussed that provide information at the individual 
level. 

In the past two decades, important contributions to assessing individual test perfor- 
mance arose from item response theory (IKT); these contributions are summarized as 
person-fit research and will be discussed extensively in this review. In most person-fit 
research the fit of a score pattern to an IRT model is investigated. However, because 
person-fit research (although not under that name) has also been conducted without IRT 
modeling, approaches outside the IRT framework will also be discussed. Furthermore, 
related research with respect to answer copying statistics will be discussed because there 
are interesting relations between these statistics and person-fit measures. 

The aim of this review is to present and discuss methods that can be used to detect 
nonfitting item score patterns. As such, this review can be considered an extension of 
Chapter 4 of the book of Hulin, Drasgow, and Parsons (1983) and Kogut (1986), in which 
overviews were given of such methods. Person-fit research has shown to be attractive 
for many researchers, which is corroborated by a proliferation of research articles in this 
area. A review of the present state of affairs thus seems to be justified. Moreover, this 
review is much more comprehensive than a recent review by Meijer and Sijtsma (1995) 
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which provided a general discussion of person-fit research. 

Person Fit or Appropriateness Measurement ? 

Methods that evaluate the fit of the individual test performance to an IRT model are usually 
referred to as appropriateness measurement methods or person-fit methods. Levine and 
Drasgow (1983, p. 110) seem to prefer the term appropriateness measurement for methods 
that “recognize inappropriate test scores”. They furthermore state (Levine & Drasgow, 
1983, p. 110) that “appropriateness measurement is only incidentally concerned with 
questions of person fit, the goodness of fit of a person’s data to a test model (...)” and 
i also ” because all tractable models are inaccurate (...) a model and a measure of model fit 

cannot be considered useful for appropriateness measurement until they have been shown 
to effectively classify appropriate and inappropriate test scores” . Although we fully agree 
that all models are inaccurate for describing individual response behavior, we think that in 
practice appropriateness measurement and person-fit measurement are one and the same 
because most methods (and especially the methods that are used by Drasgow, Levine and 
colleagues) describe response behavior based on some type of test model. This implies 
that the appropriateness of a test score is defined on the basis of the (non)fitting of an item 
score pattern to a test model. Person fit is a more general terminology than appropriateness 
measurement and we will use ’’person fit” to indicate statistical methods for evaluating 
the fit of individual test performance to an IRT model or to other item score patterns in a 
sample; the term appropriateness measurement is rather vague in this respect. 

Rationale for Person-Fit Research 

As a measure of a person’s ability level, the total score (or the trait level estimate) may be 
inadequate. For example, a person may guess some of the correct answers to multiple- 
choice items, thus raising his/her total score on the test by luck but not by ability, or an 
examinee not familiar with the test format may due to this unfamiliarity obtain a lower 
score than expected on the basis of his/her ability level (e.g., Wright & Stone, 1979, pp. 
165-190). Inaccurate measurement of the trait level may also be caused by sleeping be- 
havior (inaccurately answering the first questions in a test as a result of, for example, 

problems of getting started), cheating behavior (copying the correct answers of another 
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examinee), and plodding behavior (working very slowly and methodically and, as a result, 
generating item score patterns which are too good to be true given the stochastic nature 
of a person’s response behavior as assumed by most IRT models; see e.g., Ellis & van den 
Wollenberg, 1993; Holland, 1990). 

It is important to realize that not all types of aberrant behavior will affect individual 
test scores. For example, a person may guess the correct answers to some of the items 
but also guess wrong on some of the other items, and as the result of the stochastic nature 
this guessing process may not result in substantially different test scores under most IKT 
models to be discussed below. Whether aberrant behavior will lead to nonfitting item 
score patterns depends on numerous factors such as the type and the amount of aberrant 
behavior. 

Furthermore, it should be noted that although all methods discussed in this paper can 
be used to detect nonfitting item score patterns, several of these methods do not allow the 
mechanism that created the deviant item score patterns to be recovered. Other methods 
explicitly test against specific violations of a test model assumption or against particular 
types of deviant item score patterns. These methods may therefore facilitate the interpre- 
tation of nonfitting item score patterns. 

Person-fit Methods Based on Group Characteristics 



Statistics 

Most person-fit statistics compare an individual’s observed and expected item scores 
across the items from a test. The expected item scores are determined on the basis of 
an IRT model or on the basis of the observed item means in the sample. In this section, 
we deal with group-based statistics. In the next section we discuss IRT-based person-fit 
statistics. 

To demonstrate the similarity between several statistics, a general formula is used in 
which a particular choice of the weights ( w ) defines a particular person-fit statistic. Let 
n persons take a test consisting of k items and let n g denote the proportion-correct score 
on item g that can be estimated from the sample by n g = n g /n, where n g is the number 
of 1 scores in the sample. Furthermore, let the items be ordered and numbered according 
to decreasing proportion-correct score (increasing item difficulty): ni > > ... > 7 ^, 
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and let the realization of a dichotomous (0,1) item score be denoted by X g = x g (g= 1, 
k). Examinees are indexed i, with i = 1, The number-correct score X = r is 
the unweighted sum of item scores, £* =1 X g = r. The general formula for group-based 
statistics is given by 






r 



E 

g - 1 



Y. X 9 W 9 



9 = 1 



r 



E ~ 

5=1 



k 



E 



Wo 



(1) 



To enhance interpretation of Gi , often person-fit statistics are normed against the range 
of possible values of Gi given the definition of u/,. 

Person-fit statistics that are based on group characteristics compare an individual’s 
item score pattern with the item score patterns of the other persons in the sample. Most 
person-fit statistics are a count of certain score patterns for item pairs and compare this 
count with the expectation under the deterministic Guttman (1944, 1950) model. Let 6 
be the latent trait known from IRT, and let 6 be the location parameter which is measured 
on the scale 0. P g (6) is the conditional probability of giving a correct answer to item g. 
The Guttman model is defined by 



6<6 g & P g (0) = 0; 



and 



6>8 g ^ P g (9) = 1 . 

The Guttman model thus excludes a correct answer on a relatively difficult item h and 
an incorrect answer on an easier item g by the same examinee: X h = 1 and X g = 0, for 
all g < h. Such item score combinations (0,1) are called ’’errors” or ’’inversions”. Item 
score patterns (1,0) are permitted, and are known as ’’Guttman patterns” or ’’conformal” 
patterns. A coefficient that has received some attention is the modified caution index 
( C *) proposed by Hamisch and Linn (1981). C* is a slight adaptation of the caution 
index (C7*; Sato, 1975). C* can be obtained from Equation (1) by choosing w g = n g . 

k 

C( also is obtained by choosing w g = ir g , and then multiplying w g by r and the 

( 7 =/c— r+ 1 

other terms by k. Both statistics weigh the item scores with the proportion-correct score 
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in the sample normed against the Guttman model. For example, it can easily be checked 
that C* = 0 if an examinee with total score X = r answers the r easiest items correctly 
and the k - r most difficult items incorrectly; this means that an examinee’s item score 
is in agreement with the Guttman (1950) model. Also note that C*= 1 if the item score 
pattern equals the reversed Guttman pattern, thus indicating maximum aberrance. The 
lower bound of C, also equals 0 when an item score pattern is in agreement with the 
Guttman model, However, C, does not have a fixed upper bound and thus the values of 
Ci are more difficult to interpret than those of C*. 

Coefficients similar to C* have been discussed by Donlon & Fischer (1968), van der 
Flier (1977), andTatsuoka and Tatsuoka (1982, 1983). Donlon & Fischer (1968) proposed 
to use the personal point-biserial correlation (r^) as a person-fit statistic; r pbi3 is simply 
the correlation across all items between an examinee’s binary item scores and the vector 
containing the sample frequencies of the item scores. Furthermore, they proposed the 
personal biserial correlation ( ), which is the personal point-biserial correlation under 
the assumption of a continuous normally distributed variable underlying each of the item 
responses. Van der Flier (1977) defined Ul as the number of Guttman errors normed 
against the maximum number of Guttman errors given X = r; this maximum equals 
r(k — r). 

Tatsuoka and Tatsuoka (1983) discussed the norm conformity index, 

2 E E X g (l-x h ) 

NCIi = 1 - g=1/t=g ^ T • (2) 

r{k — r) 

The numerator contains the number of Guttman conformal (1,0) pairs of item scores 
multiplied by 2. In the case of a reversed Guttman item score vector, the number of 
conformal (1,0) pairs equals 0 and, consequently, NCIi = 1. In the case of a Guttman 
item score vector, the number of (1,0) pairs of item scores is r(k — r ) and, consequently, • 
NCIi = -1. Note that NCIi is perfectly related to Ul: NCIi = 1-2(71. Tatsuoka and 
Tatsuoka (1983) also discussed the Individual Consistency Index ( ICIi ), which is equiv- 
alent to NCIi, but is determined for subgroups of items that require the same cognitive 
solution strategy. Thus, whereas NCIi evaluates the consistency of an item score pattern 
with the other score patterns in a group, ICIi evaluates the consistency of an item score 
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pattern with an a priori defined item score pattern based on the application of a particular 
cognitive skill. 

, Kane and Brennan (1980) mentioned the agreement index, the disagreement index, 
and the dependability index that can be used as group-based person-fit statistics. The 
agreement index is defined as 



k 

A ' = J2 A >9- (3) 

9 - 1 

Let j4;(max) be the maximum value of Ai given the total score r. This maximum value 
is obtained if, given r, the item score pattern is a Guttman pattern; that is 



(max) = 

9=1 



The disagreement index is defined as 



Di = Ai (max) - A t , 



(4) 
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and the dependability index is defined as 






Ai 



Af(max) 



( 5 ) 



Note that Di equals the numerator of C* (see Equation 1, taking w g = n g ). 

Sijtsma (1986; see also Sijtsma & Meijer, 1992) proposed a person-fit statistic de- 
noted Hf . For a fixed set of k items, let P { denote the expected proportion of items to 
which person i gives the correct response across locally independent repeated measure- 
ments. Let denote the expected proportion of items to which both persons i and j. 
respond correctly. Then = P ig — 0 i p- is the covariance between the scores of persons 
i and j. Now label examinees so that i < j implies p { ^ /? .; the maximum covariance 
between two examinees is then obtained when p {j = p { and therefore cr'™ x = P { (1 - pj). 
For one examinee in relation to n — 1 examinees, 

* ^ rr max 

u ij 

The maximum value of Hj equals 1 when each of the covariances between the item 
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score patterns of examinees i and j, for all i and j ( i ^ j), attains its maximum value; 
Hj = 0 when the average covariance (numerator) = 0; and Hj < 0 if this average 
covariance is negative. Note that Hj is not normed against the Guttman pattern; Sijtsma 
(1986) showed that Hj = 1 is not necessary to obtain the perfect item score pattern . 

A group-based statistic with a known theoretical sampling distribution is Van der 
Flier’s (1980, 1982) U 3 statistic, which can be obtained from Equation ( 1 ) by choosing 




To correct for dependence on the total score. U 3 was standardized given X = r, and this 
standardized statistic is given by 



ZU 3 = 



U 3 - E(U 3) 

[Var{U2>)Y! 2 - 



(6) 



where E(U3) and Var(U 3) are the expectation and the variance of (73, respectively. Van 
der Flier (1980, 1982) showed that for long tests ZU 3 is asymptotically standard normally 
distributed. To obtain ZU3, E(U3) and Var(U 3) are needed. Note that forgiven X = r 

k 

all terms in Equation (1) are constant, except for £ X g w g .Van der Flier (1982) derived 

9=1 

k k 

expressions for E(^T, X g w g )and Var(^2 X g w g ). 

9=1 9=1 



Research Using Group-Based Statistics 




Several studies have used simulated data and empirical data for investigating the useful- 

» 

ness of group-based statistics to detect aberrant item score patterns. 

Hamisch and Linn (1981) used empirical data from a reading test and from a math 
test to obtain the correlation between several statistics and also their correlation with the 
total score. The statistics were C i7 C-, r pbis , r^, A { , Di, E iy and NCI t . Hamisch and 
Linn (1981) found that for both tests, the correlations between almost all statistics were be- 
tween .65 and .90, except for Ai which correlated approximately .40 with each of the other 
statistics. Most statistics correlated approximately .5 with the total score on both tests, 
except for C* which correlated .20 (lowest) with the total score, and A, which correlated 
.99 with the total score. Furthermore, Hamisch and Linn (1981) compared the average 
C* scores across students for groups of students from different schools. They found sig- 
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nificant between-school differences that were attributed to instructional and curriculum 
differences. 

Rudner (1983) used simulated data to compare r^, r Ws , NCU, and C, with sev- 
eral IKT-based person-fit statistics (U, W, and l, to be discussed below). High correla- 
tions ranging from .61 to .99 were found between the four group-based statistics. Two 
cases were distinguished in order to investigate the effectiveness of the statistics to de- 
tect aberrant item score patterns. In one case, for a minority of examinees several correct 
responses were randomly selected and then changed into incorrect responses, thus produc- 
ing spuriously low number-correct scores. In the second case, some incorrect responses 
were changed into correct responses producing high number-correct scores. To identify 
whether the person-fit statistics could identify the altered item score patterns, Rudner 
(1983) checked whether the spuriously high or low scores were correctly classified as 
aberrant by the statistics. In general, the conclusion was that the effectiveness of detect- 
ing aberrant item score patterns increased with the number of altered items. For example, 
with an 11% change of incorrect responses into correct responses NCU produced a detec- 
tion rate of .10, and with a 33% change NCU produced a detection rate of .20. Another 
conclusion was that for tests consisting of 45 items r Wa performed better than NCIi and 
Ci, but for longer tests (80 items) the IKT-based statistic £/, performed best. 

Miller (1986) used C t aggregated to the school class level to identify classes having 
a poor match between the content of a math test and instructional coverage. It was found 
that differences in time spend on a particular subject matter for which the test was intended 
resulted in different types of item score patterns and that in classes having a high C { other 
topics were emphasized than in classes having a low C { . Miller (1986) used the wi thin- 
class standard deviation to interpret the mean C,. 

Tatsuoka and Tatsuoka (1983) used NCU to detect deviant item score patterns in 
an arithmetic test. They compared two groups of examinees. One group consisted of 
students who were far off the mastery level and who made many different kinds of errors. 
The other group consisted of students who were close to the mastery level and only made 
sophisticated errors. Because of these differences the item difficulties were different for 
the two groups. It was found that examinees who made only sophisticated errors and who 
were included in the group far away from the mastery stage, were classified as aberrants, 
but when these same examinees were included in the group which was close to mastery 
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these examinees were classified as normal. This empirical example illustrated that NCIi 
obtains a relatively high value (indicating aberrance) if an examinee’s item scores deviate 
from the item scores of a majority of examinees in the group. In the same study, ICh 
was used to identify examinees with inconsistent item score patterns on items that require 
similar cognitive skills. 

Jaeger (1988) used C* to identifyjudges whose patterns of item judgment were aber- 
rant in a standard setting procedure (a procedure for establishing a decision rule for as- 
signing candidates to pass/fail conditions). C* ranged from .05-.62 with a mean of .32, 
and correlated .16 with the total score on a reading test and .44 with the total score on a 
mathematics test. Excluding judges with extreme C* values had no effect on the recom- 
mended test standard. 

Van der Flier (1982) used simulated data to investigate the usefulness of ZU3. In his 
first study, item score patterns were simulated on the basis of the item difficulties from 
two different populations (denoted populations I and II). ZU 3 scores were determined on 
the basis of the n g values in population I or II, and item score patterns were allocated to 
population I or II on the basis of their ZU3 scores and the significance probabilities in 
their corresponding populations. The exact decision rule on the basis of which a pattern 
was allocated to a population was unclear. \&n der Flier (1982) found that approximately 
70 % of the patterns were allocated to the correct population and that the percentage of 
correct allocations was not related to the total score. 

Furthermore, the use of ZU 3 was investigated in a cross-cultural setting. Kenyan and 
Tanzanian examinees were compared on the basis of a verbal reasoning test in Kiswahili. 
It was known that Kenyan examinees had less knowledge of Kiswahili than Tanzanian 
examinees. Van der Flier (1982) hypothesized that for examinees with low ZU3 scores 
(indicating aberrance), the test scores underestimated reasoning ability and that for groups 
of examinees with equal test scores, a more deviant group would obtain better results on 
a criterion variable. Vm der Flier (1982) found that Kenyan people with high deviance 
scores on the verbal reasoning tests had better examination results (criterion) than would 
be expected on the basis of their verbal reasoning test scores (predictor). The additional 
information provided by the person-fit scores in predicting examination results, however, 
was rather modest. 

Meijer (1994) used simulated data for comparing the detection rate of (71 and U3 and 
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found that the detection rates were comparable. Using simulated data, Meijer, Molenaar, 
and Sijtsma (1994) investigated the influence of test length, the type of aberrant responses, 
and the overall item discrimination on the detection rate of U3. They found that a priori 
defined aberrant item score patterns were easier to detect with longer tests and higher 
item discrimination. Moreover, the kind of aberrant behavior had a strong influence on 
the detection rate of (73. For example, nonfitting item score patterns were simulated by 
changing the 0 scores on the most difficult items into 1 scores (mimicking cheating) or by 
assigning a 1 score with a probability of .25 to each item (mimicking guessing). Cheaters 
were easier to detect than guessers. 

Meijer (1996) used simulated data for investigating the influence of the type and the 
number of aberrant patterns in a calibration sample on the detection rate of ZU3. An 
increase in the number of aberrant simulees resulted in biased estimates of the i r g s and in 
a decrease in the detection rate of ZU3. Furthermore, the type of misfit and the test length 
influenced the detection rate of ZU3. The use of an iterative procedure to re-estimate the 
proportion-correct score after removing aberrant patterns from the data was investigated. 
Item score patterns that were classified as aberrant were removed from the dataset and 
the proportion correct score was re-estimated until no clear improvement in the detection 
rate was found. Results suggested that this method can be used to improve the detection 
rate of ZU 3 when aberrant examinees are present in a data. 

Evaluation of Group-Based statistics 

Group-based statistics classify a score pattern as aberrant when it is different from the 
other score patterns in a group. With the exception of ZU3, researchers chose the critical 
values for classifying a score pattern as aberrant by means of rules of thumb, based on 
the characteristics of the data. For example, Hamisch (1983) suggested that for Ci a 
value higher than .6 indicated aberrance. Hamisch and Linn (1983) labelled item score 
patterns with C*>. 3 as aberrant. These critical values, however, were based on one or 
two empirical data sets. 

Criteria for the selection of useful statistics that were used by Hamisch and Linn 
(1981) and Rudner (1983) were (1) low correlation with the total score, and (2) detection 
rate. Hamisch and Linn (1981) concluded that of the statistics considered in their study, 
C* was least related to the total score and was the most suitable statistic to detect aberrant 
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item score patterns. On the basis of the literature, a full comparison of the correlations 
between person-fit statistics and the total score, and of the rates of detection seems hardly 
possible. The studies are incomplete, and the characteristics of the datasets are not always 
clear. 

The group-based statistics may be sensitive to nonfitting response behavior, but one 
drawback is that their null distributions are unknown (with the exception of ZU3 ) and, as 
a result, it cannot be decided on the basis of significance probabilities when a score pattern 
is unlikely given a nominal Type I error rate. In general, let t be the observed value of 
a person-fit statistic T. Then, the significance probability or probability of exceedance 
is defined as the probability under the sampling distribution that the value of the test 
statistic is smaller than the observed value: p* = P(T ^ t ) or larger than the observed 
value p* = P(T > t ) depending on whether low or high values of the statistic indicate 
aberrant response behavior. Although it may be argued that this is not a serious problem as 
long as one is only interested in the use of a person-fit statistic as a descriptive measure, 
a more serious problem is that the distribution of the numerical values of most group- 
based statistics is dependent on the total score (e.g., Drasgow, Levine, & McLaughlin, 
1987). This dependence implies that when one critical value is used across total scores, 
the probability of classifying a score pattern as aberrant is a function of the total score, 
however, which obviously is undesirable. 

To summarize, it can be concluded that the use of group-based statistics has been 
explorative, and with the increasing interest in IRT modeling person-fit increasingly has 
been investigated within the IKT context. 



Statistics 

Prerequisites 

In IKT the probability of obtaining a correct answer on item#(£ = 1, ..., k) is a function of 
the latent trait value (8) and characteristics of the item such as the location 8 (Hambleton 
& Swaminathan, 1985; van der Linden & Hambleton, 1997). This conditional probability 
P g {8) is the item response function (IRF). Further, we define the vector with item score 
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random variables X = (Xi, X k ) and a realization x = (27, ..., it). I RT often assumes 
that the item scores are locally independent 

k 

P(X = x|0) = Yl P g (0) x ’{( 1 - P 9 (d)) 1 -^. (7) 

9 = 1 

For any cumulative probability distribution of 9, F (9), 9 can be integrated out, which 
yields 



P(X = x 




Pg{0) X9 [{\ ~ P g (9)} l - x °dF(9). 



( 8 ) 



In order to have testable restrictions on the distribution of X, specific choices for the 
P g (9)s, for F($), or for both have to be made. Whereas F($) sometimes is chosen to be 
normal, P g {9) often is specified using the 1-, 2-, or 3-parameter logistic model (1-, 2-, 
3-PLM). The 3-PLM is defined as 
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P 9 {9)=lg + 



(1 -7 g )exp[q g (fl-<$ g )] 
1 + exp[a s (0 - <5 S )] 



(9) 



where 7 S is the lower asymptote [7 S is the probability of a 1 score for low-ability exam- 
inees (that is, 9 — ► — oo)]; a g is the slope parameter (or item discrimination parameter); 
and 6 g is the item location parameter. The 2-PLM can be obtained by fixing 7 S = 0 for all 
items; and the 1-PLM or Rasch model can be obtained by additionally fixing a g = 1 for 
all items. 

A major advantage of IRT models is that the goodness-of-fit of a model to empirical 
data can be investigated. Compared to group-based person-fit statistics, this provides the 
opportunity of evaluating the fit of item score patterns to an IRT model. To investigate the 
goodness-of-fit of item score patterns, several IRT-based person-fit statistics have been 
proposed. 

Let w g ( 9 ) and wq ( 9 ) be suitable functions. Following Snijders (1998), a general 
form in which most person-fit statistics can be expressed is 

k 

Y,X g w g (0)-w o(9). (10) 

9 = 1 
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To have a person-fit statistic with expectation 0, many person-fit statistics are expressed 
in the centered form 

k 

v = £[*„- p, WKW (11) 

9= 1 

Note that, as a result of binary scoring, X jj = X g ; thus, for a suitable function v g (0) 
statistics of the form 

k 

V = Y, - P 9 Wf ». (») • 

9 = 1 

can be re-expressed as statistics of the form in Equation (10). 

Residual-Based Statistics 

Wright and Stone (1979) and Wright and Masters (1982) proposed two mean-squared 
residual-based statistics, U and W. U is based on squared standardized residuals. The 
weight 

v,m = *f,(«)|i - p,m 

results in 






t X , - P,W] ! 






( 12 ) 



Note that the denominator contains the conditional variances of the individual item scores: 
Var(X g \6) = P 9 (0)[1 — P ff (0)]. U can be interpreted as the mean of the squared stan- 
dardized residuals based on k items. W is defined as 

£Li[* s -PgW 



W = 



e:=iW)[i-w)] 



(13) 



The difference between U and W is that Wright and Stone (1979) asssumed that W 
is less sensitive to the event of one unexpected response to an item with a difficulty far 
away from the ability of an examinee. Wright and Stone (1979) and Wright and Masters 
( 1 982) claimed that the transformation of U, 
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ZU = [In U + U + 1 ]{df /8) ^ , (14) 

with df = k — 1, and the transformation of W, 

ZW = 3(W^-l)/q + (q/3), ( 15 ) 

where q is the variance of W, are asymptotically standard normally distributed. The ap- 
propriateness of these transformations for approximating the normal distribution can be 
questioned, however, as will be discussed below. 

Two related statistics were proposed by Smith (1985). Let a test be divided into A s (s 
= 1,..., S) non-overlapping subsets of items, then the unweighted between-sets fit statistic 
is defined as 



i y {T. KA ,[x, - p,(Q)\Y 



(16) 



Let m 3 denote the number of items in subset A,, then the unweighted within-sets fit sta- 
tistic is defined as 



O 

ERJC 



i y~ [x.-pm 2 

m '£t kp *m-p,w 



(17) 



Smith (1985,1986) used critical values obtained from a simulation study for classi- 
fying examinees as normal or aberrant. For the Rasch model, Kogut (1988) showed that 
(1) the joint distribution of subtest residuals 

Z,eA.lX9 - PM 

^A.p 9 (m-p 9 m 



is asymptotically multivariate normal, and (2) the distribution of UB is asymptotically 
chi-square distributed with S degrees of freedom when 9 is used and S - 1 degrees of 
freedom when the maximum likelihood estimate 6 is used. To investigate whether the 
asymptotic distributions hold reasonably well for tests of realistic length, empirical dis- 
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tributions were simulated for tests consisting of 40 items. Kogut (1988) concluded that 
the empirical distributions were accurate enough to approximate the asymptotic distribu- 
tions. Interesting is that both UW and UB can be used as diagnostic tools investigating 
whether a priori specified subsets of items fit the IRT model (UW) or for testing the null 
hypothesis that a examinee’s ability is the same across subgroups (UB). 

Likelihood-Based Statistics 

Most studies, to be discussed below, have been conducted using some suitable function 
of the log-likelihood function 



O 
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k 

l = In Pg(0) + (1 - X g ) ln[l - P g (d)}}. (18) 

9=1 

This statistic, first proposed by Levine and Rubin (1979), was further developed and 
applied in a series of articles by Drasgow, Levine and colleagues (e.g., Drasgow, Levine, 
& Williams, 1985; Drasgow, Levine, & McLaughlin, 1991; Levine & Drasgow, 1982; 
Levine & Drasgow, 1983). Two problems exist when using l as a fit statistic. The first 
problem is that l is not standardized, implying that the classification of an item score 
pattern as normal or aberrant depends on 9. The second problem is that for classifying 
an item score pattern as aberrant, a distribution of the statistic under the null hypothesis 
of fitting response behavior, the null distribution is needed, and for l this distribution is 
unknown. Solutions proposed for these two problems are the following. 

To overcome the problem of dependence on trait level and the problem of unknown 
sampling distribution, Drasgow et al. (1985) proposed a standardized version l z of l which 
was less confounded with 6 and which was purported to be asymptotically standard nor- 
mally distributed; l z was defined as 



i z = 

[Kar (()]*’ 



(19) 



where E (l) and Var(l) denote the expectation and the variance of l, respectively. These 
quantities are given by 



E <o = £ < p » m >" to mi + [• - p, mi i» n - p, m , 



9 = 1 
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and 






In 



s=i 



P,W 

i-p g (d) 






(21) 



Molenaar and Hoijtink (1990; 1996) argued that l z is only standard normally distributed 
when the true 9 values are used, but in practice 9 is replaced by the maximum likelihood 
estimate 9. Using an estimate and not the true 9 will have an effect on the distribution of a 
person fit statistic, as was shown by Molenaar and Hoijtink (1990), Nering (1995, 1997), 
and Reise (1995). These studies showed that when maximum likelihood estimates 9 were 
used, the variance of l z was smaller than expected under the standard normal distribution 
using the true 9, particularly for tests up to moderate length (say, 50 items or less). As a 
result, the empirical Type I error was smaller than the nominal Type I error. 

For the Rasch model, Molenaar and Hoijtink (1990, p. 96) showed that Iq can be 
written as the sum of two terms. Given £* =1 X g = r (that is, given 9, which in the 
Rasch model only depends on the sufficient statistic r ) one term is independent of the 
item score pattern and the other is dependent on it. Following their notation, the former 
part is denoted by do and the latter by M, such that 



1 = do + M 



with 

k 

dp = - ^ ln[(l + exp(fl - <$„)] + t9 
s=i 

and 

k 

M = gXg. (22) 

9=1 

Note that if the distribution of l conditional on X = r is considered, dp is in- 
dependent of X, and l, and M have the same ordering in X. Because of its simplicity, 
Molenaar and Hoijtink (1990) used M rather than l as a person-fit statistic. Molenaar 
and Hoijtink (1990) proposed three approximations to the distribution of M: using (1) 
complete enumeration, (2) Monte Carlo simulation and, (3) a ^-distribution, where the 

mean, standard deviation, and skewness of M are taken into account; see Molenaar and 

O 

ERiC 



20 



METHODOLOGY REVIEW: PERSON FIT - 19 



ERIC 



Hoijtink (1990) for the conditions when to use either one of these approaches. In line with 
this research, Liou and Chang (1992) proposed a network algorithm that enumerated all 
possible response patterns to construct exact tail probabilities for l and Bedrick (1997) 
derived alternative methods to approximate the first two moments of M. 

Drasgow, Levine, and McLaughlin (1991) proposed a generalization l zm of the l z 
statistic for tests consisting of S unidimensional subtests or testlets. This statistic has a 
similar expression as l z , but now the expectation and variance are taken over S subtests: 

£[(/W.-E(/W)] 

lzm ~ ^ “• (23) 

[V r ar(/( s ))] 1 / 2 

5=1 



Although Drasgow et al. (1991) showed that l zm was effective in detecting aberrant item 
score patterns, detection rates were approximately equal to those for long unidimensional 
tests with a number of items equaling the total number of items in the S testlets. In 
practical test situations, the use of l zm suffers from the same problems as l z : using 9 
instead of 6 will result in inappropriate approximations to probabilities of exceedance. 
Using L in the context of the 3-PLM, Nering (1995) found that the empirical Type I error 
in general was lower than the nominal Type I error. 

Snijders (1998) derived the asymptotic sampling distribution for a group of person- 
fit statistics that all have the form given in Equation (11) and for which the maximum 
likelihood estimate 9 was used instead of 9. It can easily be shown (Snijders, 1998) that 
lo - E(l 0 ) can be written in the form of Equation (11) choosing 



Ulg 



( p,a>) \ 



Snijders (1998) derived expressions for the first two moments of the distribution: E 



V{9) 



and Var V(9) ; and performed a simulation study for relatively small tests consisting of 
8 and 15 items, fitting the 2-PLM, and using maximum likelihood estimation for 9. The 



results showed that the approximation was satisfactory for a = 0.05 and a = 0.10, but 
that the empirical Type I error was higher than the nominal Type I error for smaller values 
of a. 

Drasgow, Levine, & McLaughlin (1987) proposed two fit statistics that are sensitive 
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to the flatness of the likelihood function. The idea was that when there is no single value 
of 9 that provides a good fit for an item score pattern, the likelihood function will be 
relatively flat. The first statistic ( JK ) is a normalized jackknife variance estimate. Let 

9 denote the 3-PLM maximum likelihood estimate of 9. based on all k items in the test, 

** * 

and let 9 ^ denote the estimate based on k — 1 items remaining when item g is excluded. 
Then 



9 g = k~9 — (k - 1)^( 9 ) for g = 1 k. 

and the jackknife estimate of 9 is 






9=1 



with variance 



Vari?) = 



Because there is more Fisher information about 9 in some ranges of 9 than in other ranges 
of 9, Var(9 ) depends on 9. Therefore, Var(9 ) was weighted by the Fisher information, 
1(9), of which the reciprocal is the asymptotic variance of 9, which resulted in 



JK = Var(9')I(9). (24) 

The second person-fit statistic was the ratio of the observed and the expected infor- 
mation: 

a 2 i | 

° /E= I^- (25) 

The idea behind this statistic was that if the likelihood l (see Equation 18) is flatter for 
aberrant responses than for normal responses, then the observed information is expected 
to be smaller than the expected information. 

Drasgow et al. (1987) also proposed to use the variance of the mean number-correct 
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score of examinees who selected option d of item g (item-option variance: IOV): 

IOV = Var(X dg ). (26) 

Conditional on 9, large values of IOV point at deviant behavior. 

Statistics Based on the Caution Index 

Tatsuoka and Linn(1983) derived several statistics which were similar to the caution index 
Ci discussed by Hamisch and Linn (1981) and which were adapted to IRT modeling. Let 
X, be the vector of item scores of examinee z; let X* be the theoretical Guttman vector, 
and let n be the vector with the item number-correct scores across examinees. The caution 
index can be written as 



cou(Xj, n) 
cov(X-, n) 



(27) 



Also, let P(6?) be the vector with conditional probabilities P g (9 ) across items. A vector 
P(9) is defined for each 9. By norming against the covariance between the probability of 
a correct response under an IRT model and vector n, statistic EC 1 1 was obtained as 



ECU = 1 - 



Cov(Xi, n) 
Cou[P(0),n] 



(28) 



Similarly EC 12 (and, likewise, EC 13) was obtained by taking the covariance (corre- 
lation) between an item score vector and the vector with the mean probability of cor- 
rectly answering an item across n examinees, G = ( Gi,...,G k ) with elements G g = 

1 /n DU, /»,(#): 



and 



EC 1 2 = 1 



Cav[Xi, G] 
Cov[P(9), G] ’ 



(29) 




EC 1 3 = 1 



CorrfXj.G] 

Corr[P{9),GY 



( 30 ) 
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ECIA, EC/5, and EC IQ were obtained by taking the covariance or the correlation 
between the response vector X{ and P(0), resulting in the following statistics 



EC I A = 



Cou[X,,P(fl)] 
Cot/[G,P(0)] ’ 



(31) 



and 



EC IQ = 1 



CoTTpCj, P(fl)j 
Corr[G, P(0)] ’ 



(32) 



EC IQ = 



Cc*;[X„P(fl)] 

Var[P(0)] 



(33) 



An important difference between these statistics is that EC 12 and EC 1 3 compare an 
individual item score pattern with the mean probability across persons and thus compare 
an individual item score pattern with group characteristics, whereas EC/4, EC/5, and 
EC IQ compare an individual item score pattern with the expected probability on the basis 
of a model. EC I A is normed against the mean probability across items and EC IQ is 
normed against the variance of P(0). EC/3 and EC/5 are similar to EC/2 and EC/4, 
with the difference that in EC 1 3 and EC IQ correlations are used instead of covariances. 
Tatsuoka (1984) derived the expectations and the variances of EC/1, EC/2, EC/4, and 
EC IQ and used these to obtain standardized versions of these .indices (subtracting the 
expected values and dividing by the standard errors). These standardized indices were 
denoted EC/1,, EC/2,, EC/4,, and EC/5,. 

Although it has sometimes be remarked that likelihood statistics and the EC I sta- 
tistics are based on different approaches to person-fit (e.g., Hamisch & Tatsuoka, 1983; 
Kogut, 1986; Nering, 1997), it can be shown that both approaches are of the form of 
Equation (11). For example, the centered form of EC/4, that is, EC/4 - E(ECIA) can 
be obtained by choosing 



k 

where P(9) = 1/k P g (9), and the centered form of EC/2 can be obtained by choosing 

g= i 
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'HJg — G) 

_ k 

where G = l/k £3 G g . 

9 = i 

Optimal Person-Fit Statistics 

Levine and Drasgow (1988; see also Drasgow & Levine, 1986; Drasgow, Levine, & 
Zickar, 1996) proposed a method for the identification of aberrant item score patterns 
which was statistically optimal; that is, no other method can achieve a higher rate of de- 
tection at the same Type I error rate. A likelihood ratio statistic was determined which 
provided the most powerful test for the null hypothesis that an item score pattern is nor- 
mal versus the alternative hypothesis that it is aberrant. The researcher in advance has 
to specify a model for normal behavior (e.g., the l-,2-, or 3- PLM) and a model that 
specifies a particular type of aberrant behavior (e.g., a model in which violations of local 
independence are specified). The likelihood ratio statistic 



A(X) = 



P(-^- X.) aberrant 

PyK = x ) normal 



(34) 



is calculated, and those patterns are classified as aberrant (1) which have the largest A(X) 
and (2) whose likelihoods under the model describing normal response behavior sum up 
to the a level. 

Klauer (1991, 1995) investigated aberrant item score patterns by testing a null model 
of normal response behavior (Rasch model) against an alternative model of aberrant re- 
sponse behavior. Writing the Rasch model as a member of the exponential family, 



P(X = x | 0) — fj,(d)h{x) exp[0il(x)], (35) 



where 

fc . 

MW = nt 1 + exp ^ ~ 

9 = 1 




25 



METHODOLOGY REVIEW: PERSON FIT - 24 

and R(x) = number-correct score, Klauer (1995) modelled aberrant response behavior 
using the two-parameter exponential family, and introducing an extra person parameter 
77 , as 



P(X = x | 9, Tj) = rj)h(x) exp[r]T(x) + 9R(x)], (36) 

where T(x) depends on the particular alternative model considered. Using the exponen- 
tial family of models, a uniformly most powerful test (e.g., Lindgren, 1993, p. 350) can 
be used for testing Ho: tj = r] Q against Hi: 77 ^ tj 0 . Let a test be subdivided into two 
subtests Ai and A?, then, as an example of 77, 77 = #1 — 62 was considered were 9 \ is an 
individual’s ability on subtest A\ and #2 is an individual’s ability on subtest A 2 . Under the 
Rasch model, it is expected that 9 is invariant across subtests and thus Ho: 77 = 0 can be 
tested against H! : 77 ^ 0 . For this type of aberrant behavior T(x) is the number-correct 
score on either one of the subtests. 

Klauer (1995) also tested Ho of equal item discrimination parameters for all persons 
against person-specific item discrimination and Ho of locally independence against viola- 
tions against local independence. Results showed that the power of these tests depended 
on the type and severeness of the violations. Violations against non-invariant ability (Ho: 
77 = 0) were found to be the most difficult to detect. Liou (1993) discussed refinements 
using these types of tests. 

Interesting in both the Levine and Drasgow (1988) and Klauer (1991, 1995) ap- 
proaches is that model violations are specified in advance and that tests are proposed 
to investigate these model violations. This is different from the approach followed in 
most person-fit studies where the alternative hypothesis simply says that the null hypoth- 
esis is not true. An obvious problem is which alternative models to specify. A possibility 
is to specify a number of plausible alternative models and then successively test model- 
conform item score patterns against these alternative models. Another option is to first 
investigate which model violations are most detrimental to the use of the test envisaged 
and then test against the most serious violations (Klauer, 1995). 

The Person- Response Function 




Trabin and Weiss (1983; see also Weiss, 1973) proposed to use the person response func- 
tion (PRF) to identify aberrant item score patterns. At a fixed 9 value, the PRF specifies 
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the probability of a correct response as a function of the item location <5. In IRT, the 
item response function often is assumed to be a nondecreasing function of 9, whereas the 
PRF is assumed to be a nonincreasing function of 6 (Trabin & Weiss, 1983). To con- 
struct an observed PRF, Trabin and Weiss (1983) ordered items to increasing 6 values and 
then formed subtests of items by grouping items according to <5 values. For fixed 9, the 
observed PRF was constructed by determining, in each subtest, the mean probability of 
a correct response. The expected PRF was constructed by estimating according to the 
3-PLM, in each subtest, the mean probability of a correct response. A large difference 
between the expected and observed PRFs was interpreted as an indication of nonfitting 
responses for that examinee. 

Let k items be ordered by their 6 values and let item rank numbers be assigned ac- 
cordingly, such that 



< 62 < ... < 6k ( 37 ) 

Furthermore, let 9 be the maximum likelihood estimate of 9 under the 3-PLM. Assume 
that A s ( s = 1, ..., S ) ordered subtests can be formed, each containing m items; thus, 
Ai = {1, ..., m}, A 2 ={m + 1, ...,2m},... , As = {A: — m + 1, ...,k}, and note that 
S * m = k. To construct the expected PRF, an estimate of the expected proportion of 
correct responses on the basis of the 3-PLM in each subtest is taken: 

P s(^)’ ^ s = 1, 2, •••> S. 

g€A a 

This expected proportion is compared with the observed proportion of correct responses, 
given by 

m~ l y] Xg, for s = l,2, ..., S. 

g£A. 

Within each subtest, for a particular 9 the difference of observed and expected correct 
scores is taken and this difference is divided by the number of items in the subtest. This 
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yields 



D s (6)=m- l J2 [ X 9~P»0) 



g€A, 



, all s = 1, 5. 



(38) 



Next, the Ds are added across subtests, which yields 



s 

D{0) = Y j D s {9). (39) 

S= 1 

D(0) was taken as a measure of an individual’s fit to the model. For example, when an 
examinee copied the answers on the most difficult items, for such examinees scores on the 
most difficult subtests are likely to be substantially higher than suggested by the expected 
PRF Related ideas were discussed by Lumsden (1977, 1978). 

Klauer and Rettig (1990) expanded the methodology of Trabin & Weiss (1983) by 
proposing three person-fit statistics that were standardized and for long tests asymp- 
totically followed a chi-square distribution. Let a test be divided into S subtests with 
5 = 1, S , and let the latent trait estimate for the total test ( k items) be denoted by 9. 

One of the statistics was 
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where V s (9) is of the form given in Equation (11), 

m = £ [ 

with 



9€A S 



X g - F,(*)] w g (9) 



(40) 



(41) 



w g {9) = 



dP g (9)/d9 

P g (9)[l-P g (9)Y 



(42) 



and I 3 {9 ) is the estimated Fisher’s information function. To test whether 9 is invariant 
across subtests, the null hypothesis Ho: 9i = 9 2 = ... = 9s is tested. Under Ho, xlc has 
a chi-square distribution with df = S — 1. Note that this test is similar to the method 
proposed by Trabin & Weiss (1983) but differs in that xl c * s standardized and asymptot- 
ically chi-square distributed. Klauer and Rettig (1990) also proposed two related tests. 
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The first was the Wald test, which directly compares a person’s ability estimates obtained 
from different subtests. The other was a likelihood ratio test. By means of Monte Carlo 
research Klauer and Rettig (1990) showed that the chi-square distribution of \^ c was ap- 
propriate for tests of at least 80 items. For the Wald test and the likelihood ratio test the 
difference between the theoretical and empirical chi-square distributions was too laige to 
be of practical use. 

Research Using IRT-Based Person-Fit Statistics 

Several studies have addressed the usefulness of IKT-based person-fit statistics. In most 
studies simulated data were used, and in some studies, empirical data were used. We will 
distinguish studies investigating the 

1 . detection rate of fit statistics and comparing fit statistics with respect to several criteria 
such as distributional characteristics and relation to the total score; 

2. influence of item, test, and person characteristics on the detection rate; 

3. relation between nonfitting score patterns and the validity of test scores; and 

4. applicability of person-fit statistics to detect particular types of nonfitting item score 
patterns. 

Although some studies may be categorized under more than one heading, we discuss 
it under the heading where it seems to have its largest contribution. 

Studies Investigating Detection Rate and Comparing Fit-Statistics 

Levine and Rubin (1979) evaluated statistic l in a study that simulated item score vectors 
using item parameters estimated from the Scholastic Aptitude Test (\ferbal). Spuriously 
high-scoring examinees were simulated by randomly sampling a fixed percentage of the 
item scores of normal examinees (generated using the 3-PLM) and changing these scores 
into 1 scores: Spuriously low-scoring examinees were simulated by sampling a fixed 
percentage of the item scores and rescoring the items as correct with a probability of 
0.20. The percentages of items scores that were taken were 4, 10, 20, and 40. Levine 
and Rubin (1979) found that the larger the group of aberrant item scores the better l could 
distinguish normal from aberrant score patterns. They also found that spuriously high- 
scoring simulees were easier to detect than spuriously low-scoring simulees. This could 
be understood because more item scores were changed for the spuriously high-scoring 
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examinees. 

In another study using /, Drasgow (1982) compared the detection rates of l using 
either the Rasch model or the 3-PLM to describe the data of the Graduate Record Exam- 
ination. He found higher detection rates for examinees with spuriously low manipulated 
item scores than for examinees with spuriously high manipulated item scores. Further- 
more, detection rates for this dataset were higher using the 3-PLM than using the Rasch 
model. 

Hamisch and Tatsuoka (1983) used National Assessment of Educational Progress 
(NAEP) data on mathematics to investigate the correlations, scatterplots, and tests for 
curvilinearity for ECIa (a= 1, 2, ..., 5), ECI\ Z ECI2 Z ,ECI4 Z , U, W, Z,and l z . U was 
used under the 2-PLM and the 3-PLM and l was used under the 3-PLM and under the 
Normal-Ogive model (Hambleton & Swaminathan, 1985, pp. 35-36). Hamisch and Tat- 
suoka (1983) found that EC 1 1, EC 12, and EC I A had standard deviations of approx- 
imately 1 and means of approximately .20. U correlated lowest with the other indices 

(approximately .10), and the other indices correlated between .50 and .98 with each other. 

♦ 

Z- and Z correlated highest with the total score: .36 and .27, respectively. Furthermore, they 
found the strongest curvilinear relationship between the total score and Z and between the 
total score and W . 

Drasgow, Levine, & McLaughlin (1987) used the 3-PLM for comparing the person-fit 
statistics l z , ZU, ZW, C, IOV, JK , O/E , ECI2 Z , and ECI4 Z , with (1) optimal statistics, 
(2) their standardization (i.e., if the distribution of the statistics was comparable across 
&) and (3) detection rate. To determine the detection rate, nonfitting item scores were 
simulated in a similar way as in the Levine and Rubin (1979) study. Detection rates were 
found by determining the proportions of aberrant item score patterns that were correctly 
identified as aberrant when various proportions of normal aberrant item score patterns 
were misclassified as aberrant. Drasgow et al. (1987) concluded that ZU, C, and IOV 
were poorly standardized compared to the other statistics. They also found that ECI4 Z 
was better standardized and had a higher detection rate than ECI2 Z . Furthermore, it was 
found that the O/E and the JK statistics were reasonably well standardized, but that 
these statistics were quite ineffective for the detection of aberrant item score patterns. 
One of the most interesting results was that l z , ZW, and EC 1 4 Z had high detection rates 
for spuriously high scoring examinees having low B values and for low scoring examinees 
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having high 6 values (e.g., the detection rate of ECIA Z was .75 for low 0 values at a Type I 
error of .01 and 30% spuriously high scoring examinees). However, these statistics were 
less sensitive to manipulated response patterns for 6 values around the mean 0 (for the 
example given above the detection rate decreased from .75 to .51 for low 6 values at a 
Type I error of .01). Optimal indices had detection rates from 50% to 200% higher than 
other indices for average-ability examinees and item score, patterns with spuriously high 
or spuriously low test scores. 

Rogers and Hattie (1987) investigated the detection rate of ZU and ZW. Transforma- 
tions of both statistics were claimed (Wright & Stone, 1979) to be asymptotically standard 
normally distributed. Rogers and Hattie (1987) determined the detection rate of ZU and 
ZW using theoretical critical values for guessing, heterogeneity of the discrimination 
parameters, and multidimensionality. They concluded that ZW was insensitive to het- 
erogeneity of the discrimination parameters and to multidimensionality and sensitive to 
guessing; ZU was insensitive to guessing, heterogeneity of the discrimination parameters 
and to multidimensionality. Detection rates increased by no more than 2% compared to 
normally responding examinees. 

Noonan, Boss, and Gessaroli (1992) investigated the distributional characteristics and 
the empirical critical values of l z , ECIA Z , and ZW as a function of the test length and 
the IRT model (2-PLM and 3-PLM). They found that both l z and ECIA Z had means and 
standard deviations (SD) that approximated the standard normal distribution. However, 
ZW had a mean over replications of approximately 1.00 but a SD between .144 and 
.232. Furthermore, they found that ECIA Z and ZW were positively skewed and that l z 
was negatively skewed, whereas the skewness of ECIA Z was half the skewness of the 
other two statistics. They also found that for all three statistics, the critical values were 
affected by test length and IKT model. The critical values of ZW were most affected. 
They concluded that ECIA Z best approximated the normal distribution and, moreover, 
was less affected by test length and the IKT model. l z and ECIA Z were highly correlated 
(.95), whereas ECIA Z and ZW had the smallest correlation (.58). However, true 0 values 
were used which makes the generalization to empirical distributions difficult. 

Li and Olejnik (1997) compared the distribution of five person-fit statistics that were 
normally distributed assuming the Rasch model: l z , ECI2 Z , EC I A z ,ZU, and ZW. They 
found that (1) the statistics had low correlation with the total score; (2) the statistics were 
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positively skewed and deviated significantly from normality, where ECI4 Z was better 
normalized than EC 1 2 Z ; (3) l z performed at least as well as the other statistics in detecting 
aberrant behavior; (4) examinees with spuriously low and spuriously high total scores 
were equally well detectable when unidimensional data were used, whereas detection 
rates of spuriously low total scores were lower than detection rates of spuriously high total 
scores when a multidimensional test was used; and (5) person-fit statistics were not very 
powerful in identifying aberrant item score patterns; l z was most powerful and detected at 
most 67% of the aberrant item score patterns. However, because it was assumed that the 
true 9 equaled the maximum likelihood estimate 9, these conclusions suffer from the same 
shortcomings as the earlier work by Drasgow, Levine, and colleagues. As was discussed 
above, when using the Rasch model as Li and Olejnik did, it is better to condition on the 
total score, which is independent of 9, and to use statistic M. This was done by Kogut 
(1987), who used simulated Rasch model data to show that the detection rate of statistic 
M for detecting aberrant item score patterns was higher than l z . 

Trabin & Weiss (1983) applied the PRF approach to a 216 item vocabulary test which 
had been administered to 151 graduate students. To investigate whether the responses 
were in agreement with the 3-PLM, they used D(6) for evaluating for each student the 
discrepancy between the observed and the expected PRF and assumed that D(9) was chi- 
squared distributed. Some students had significant chi-squares but the cause of aberrance 
could not be explained. 

Nering and Meijer (1998) used simulated data for comparing the PRF approach with 
the l z statistic and found that in most cases the detection rate of l z was higher than that 
of the PRF method. They suggested that the PRF approach and l z can be used in a com- 
plementary way: aberrant item score patterns can be detected using l z> and differences 
between expected and observed PRFs may be used to retrieve more information at the 
subtest level. 

Influence of Item, Person, and Tfest Characteristics 

Several simulation studies investigated the detection rate and the distributional character- 
istics of person-fit statistics as a function of person and test characteristics. In general, the 
detection rate was defined as the percentage of aberrant simulees classified as aberrant 
by a particular statistic. 
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Levine and Drasgow (1982, 1983) investigated if (1) the detection rate of l was in- 
fluenced by using estimated item parameters instead of using true parameters, and if (2) 
the presence of aberrant item score patterns influenced the item parameter estimates and 
the detection rate. Response vectors were simulated according to the 3-PLM using the 
estimated item parameters from a previous calibration study of the SAT (\ferbal). Aber- 
rant item score vectors were simulated by randomly selecting from each vector 20% of 
the item scores (Os and Is) and changing these item scores with a probability of .20 (Is 
became Os and Os became Is). They concluded that the detection rate of l was not seri- 
ously affected by the estimated item parameters and by the presence of nonfitting item 
score patterns. Kogut (1987), however, concluded from his simulation study that, as a 
result of the presence of deviant item score patterns in the sample, the power of i z and M 
was seriously reduced. Possible explanation for the different results of both studies were 
the different statistics that were used and the different numbers of simulated item score 
patterns in both studies. In the Levine and Drasgow study, the percentage of nonfitting 
response vectors was 6.7, and in the Kogut study this percentage was 20. The higher per- 
centage of nonfitting item score patterns may have reduced the power. Furthermore, the 
type of nonfitting item score vectors also may have been responsible for reduced power. 

Reise and Due (1991) found that longer tests and larger spread between the item 
difficulties resulted in higher detection rates for l z . They simulated item scores with less 
Fisher information for estimating 0 than predicted by the parameters of an IRT model; 
that is, item scores were simulated using different levels of the a-parameter (which is 
related to item information, Hambleton & Swaminathan, 1985, p. 105). Furthermore, 
they varied test length from 7, 21, 35, to 49 items and also varied the spread in the 6s and 
the 7s. Reise and Due (1991) concluded that test length, spread of the 6s and the value 
of the 7 s each affected the detection rate of l z . They found that, in general, longer tests, 
larger spread of the 6s and low 7 s values resulted in higher detection rates. Furthermore, 
they concluded that l z obtained its lowest detection rate for low 6 values. 

Parsons (1983) investigated the effectiveness of a transformed version of l to detect 
simulated aberrant item score patterns on a personality inventory, the Job Descriptive 
Index, that measured satisfaction with multiple facets of a job. Data were generated ac- 
cording to the 2-PLM using the estimated item parameters from an empirical calibration 
sample. Twenty out of the 60 items were selected and for these items, scores were gen- 
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erated with a probability of .30 of obtaining the correct response. Results indicated that 
higher detection rates were obtained at higher 9 s. This could be explained because for 
these simulees more item scores were changed. Furthermore, it was found that the vari- 
ance of the total score for aberrant item score patterns was lower than for normal patterns. 
The explanation was that aberrant item scores are probably uncorrelated with each other 
and this reduced the variance of the total score compared to the total score on a set of 
items that are correlated. 

Smith (1985) compared robust estimators with the person-fit statistics. Robust es- 
timators correct for unexpected responses and weigh these unexpected responses less to 
obtain a representative latent trait estimate. Smith concluded that it is better to use person- 
fit analysis because the robust estimators introduce a bias in the estimation of the latent 
trait. 

Reise (1995) investigated the detection rate of the l z person-fit statistic as a function of 
using true 9 and several estimates of 9: maximum likelihood estimation (MLE), expected 
a posteriori (EAP) estimation and biweight (BIW) estimation. To estimate 9, datasets 
were simulated based on the estimated item parameters of four personality scales that fit 
the 2-PLM. Reise found that using true 9 resulted consistently in the highest detection 
rate for l z . The detection rate of l z differed between the three estimation methods, but 
this difference depended on the type of test, the 9 level, and the percentage of nonfitting 
responses. Reise also found that BIW estimation typically resulted in a somewhat higher 
detection rate than EAP and MLE. Meijer and Nering (1997) investigated the detection 
rate of l z using MLE, EAP, and BIW and also the bias in 9 as a function of different 
types of aberrant behavior. They found that the presence of aberrant item score patterns 
influenced the bias in 9, and this depended heavily on the type of misfit and the 9 level. 
It was also found that the BIW scoring method reduced the bias in 9 and improved the 
detection rate relative to MLE and EAP for examinees located at both extremes of the 9 
continuum. 

Application of Person-Fit Statistics to Empirical Data 

Birenbaum (1985) compared the effectiveness of nine IKT-based statistics, which were 
ECU, ECI2, ECM, their standardized versions, and /, l z , and U in distinguishing 
among the following types of empirical item score patterns: item scores of an uncoop- 
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erative group, item scores of a cooperative group and item scores of a group in which 
scores had randomly been generated item scores. The groups were distinguished from 
each other on the basis of (1) motivation to take a test (rated by a test administrator) and 
(2) whether the student wrote his/her name on the test answer sheet. The test was only 
administered for research and development purposes. Except for statistic U , Birenbaum 
(1985) found significant differences in the mean value of the other statistics between the 
three groups. The correlation between the standardized indices was high (.90). However, 
l and*(7 had a low correlation of .10. Most statistics had low correlations with the total 
score (between .13 and .22). Curvilinearity between the person-fit statistics and the total 
score was rejected for none of the three unstandardized EC Is. Largest curvilinearity was 
detected for /, indicating that this index yielded the most inflated values at both extremes 
of the ability scale: 

Birenbaum (1986) investigated the relation between four person-fit statistics, ECI1 Z , 
EC 1 2-, ECIA Z , and l z on the one hand, and the scores on an anxiety scale and a lie scale 
of the MMPI, and a general ability test on the other hand. It was hypothesized that a 
sample of examinees with low anxiety scores but with high lie scores has a less appropriate 
item score pattern on an ability test than low-anxiety examinees who scored low on a lie 
scale, because persons with high lie scores have the desire to deliberately impress the 
assessor by saying that they have low anxiety, but they cannot conceal the effect of their 
anxiety on the cognitive reasoning test scores. Birenbaum (1986) found high correlations 
between the four person-fit statistics (between .97 and .99). Furthermore, low correlations 
were found between the scores on the lie scale and the fit statistics (.10) and between the 
scores on the anxiety scale and the fit statistics (.14). Scores on the ability scale correlated 
.50 with the fit indices. There was indeed a significant difference between the mean scores 
of the person-fit statistics between the two groups, where examinees with low anxiety and 
high lie scores were found to be more aberrant than examinees with low anxiety and high 
lie scores. 

Hoijtink (1987) investigated the effect of nonfitting item score patterns on the item fit 
to the Rasch model. The item score patterns were from two empirical datasets from a ques- 
tionnaire measuring neurological and ophthalmic skills for general practitioners. Aber- 
rant item score patterns were removed from the dataset and it was investigated whether 
this resulted in a better fit of nonfitting items to the Rasch model. To minimize the dan- 
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ger of adapting the data to the model, item score patterns only were removed under the 
condition that they should be classified as aberrant both under the original and improved 
item estimates and under the condition that the fit of the dataset as a whole improved af- 
ter removing nonfitting examinees. Hoijtink showed that removing nonfitting item score 
patterns resulted for some items in a better fit to the model. However, it could not be ex- 
plained why some examinees answered the questionnaires in a deviant way, as was done 
in the Birenbaum (1985) study. 

Rudner, Bracey, and Skaggs (1996) investigated the use of statistic W in the context 
of the 1990 NAEP Trial State Assessment. They found almost no examinees with extreme 
item score patterns. Eliminating examinees with the worst fit did not result in meaningful 
differences in the mean NAEP scale scores between trimmed and untrimmed data. 

Reise and Waller (1993) explored the use of l z in personality measurement by analyz- 
ing empirical data of the Multidimensional Personality Questionnaire (Tellegen, 1982). 
Three possible applications of person-fit measurement in the context of personality re- 
search were discussed: detection of measurement error, detection of variation due to 
faulty responding, and detection of variation due to inappropriateness of the personal- 
ity trait measured by the test for describing several examinees. Reise and Waller (1993) 
noted that it is difficult to distinguish persons not fitting the particular trait from misfit due 
to error of measurement or faultiness. To reduce the chances that misfit of a person’s item 
score pattern was attributed to measurement error or faulty responding, they used unidi- 
mensional subscales and information from detection scales that flag persons exhibiting 
inconsistent answering behavior. By means of l z persons could be identified not respond- 
ing according to the 2-PLM and who were not flagged by inconsistency scales. However, 
the accuracy of the classification could not be evaluated because it was unknown which 
persons really behaved in an aberrant way. 

Zickar and Drasgow (1996) analyzed a dataset from a personality test that consisted 
of item scores from examinees who had been instructed either to respond honestly to the 
test or to fake the answers to convey a favorable impression. They found that optimal 
person-fit statistics classified a higher number of faking respondents than did a social 
desirability scale. The detection rates, however, were low (mostly between . 10 and .30). 

Molenaar and Hoijtink (1996) investigated the use of statistic M given in Equation 
(22) in the context of the Rasch model for a test in which four- to seven-year-olds had 
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to indicate which of the three pictures presented was consistent with the item. Each pic- 
ture consisted of a number of balls and stars which were colored white and black. They 
simply identified patterns with low probability of exceedance. For example, ordering 
the items from easy to difficult they identified on an 11 -item test the response pattern 
(00000010011), which had a significance probability of .002, and concluded that this pat- 
tern was a candidate for closer inspection. 

Validity of Test Scores and Aberrant Response Behavior 

The effect of deviant response behavior on validity and decision-making was investigated 
in several studies. The importance of the relation between deviant response behavior and 
decision making was underlined by Drasgow and Guertler (1987). They argued that over- 
or underestimating 9 may have serious consequences. Overestimating 9 may result in se- 
lecting persons that are not able to fulfill a job and underestimating 9 may be expensive for 
the company due to extra selection efforts that are needed. They presented a utility theory 
approach to the use of person-fit statistics in practical settings. The approach requires the 
distribution of a statistic in samples with normal and aberrant item score patterns. On the 
basis of the probabilities of score patterns under these distributions, the utility could be 
estimated and the critical value of a statistic could be determined in line with the estimated 
utility. ' 

Schmitt, Cortina, and Whitney (1993) investigated whether aberrant item score pat- 
terns may distort both estimates of criterion-related validity and estimates of the rela- 
tionship between trait levels and performance constructs. Using l z and the 3-PLM, and 
four empirical datasets, they found little or no improvement of the correlation between 
the predictor and the criterion when aberrant item score patterns were removed from the 
data. However, a hierarchical regression analysis in which the criterion scores were re- 
gressed onto (1) the predictor scores, (2) group membership based on l z scores (normal 
or aberrant), and (3) their cross products, showed for some data sets a significant interac- 
tion term, implying that l z scores may improve prediction. Meijer (1997) used simulated 
data for investigating the relation between aberrant response behavior and validity of test 
scores. He concluded that nonfitting item score patterns can influence the validity of 
a test if the type of misfit is severe, the correlation between the predictor and criterion 
scores is .3 or .4, and the percentage of nonfitting item score patterns is relatively high (at 
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least .15 or higher). However, using l z for removing aberrant item score patterns from a 
predictor test appeared to have little impact on the validity coefficient with the criterion 
test. These results confirmed the results found by Schmitt et al. (1993) and can partly be 
explained by the less than perfect detection rate; in. the most favorable case approximately 
40% of the aberrant item score patterns remained in the sample. Meijer (1998) used ZU3 
to identify persons with unexpected item scores on empirical selection data. It was shown 
that, in general, persons with inconsistent item scores are less predictability than persons 
with consistent item scores. Both persons with lower criterion scores and persons with 
higher criterion scores than predicted could be identified. 

Statistics for Detecting Answer Copying 

Person-fit statistics can be used for identifying individuals with item score patterns that 
are unlikely given the IRT model under consideration. Levine & Rubin (1979), Hulin et 
al. (1983) have suggested that these statistics also can be used to detect answer copying. 
However, to detect answer copying also methods are available that were designed espe- 
cially for this purpose. We will not extensively discuss answer copying statistics because 
this was already excellently done by Frary (1993).Wfe only mention two examples. Most 
answer copying statistics directly compare the item score patterns of two examinees and 
determine the number of item scores they have in common. Large proportions of similar 
item scores suggest answer copying. 

One of the most promising statistics in this research area is the g 2 statistic (Frary, 
Ti deman, & Watts, 1977). To calculate this statistic, a copier (i) and a source (j) have to 
be specified in advance. Let Ny denote the number of items that is answered identically by 
persons i and j, and let X gi = x gi and X g j = x g: denote the realization of an item score of 

k 

person i and an item score of person j, respectively. Then N ig = T(X gi = X g: ), where 

s=i 

T = 1 if i and j select the same alternative to item g, and T = 0 otherwise. Furthermore, 
let Xj denote the score pattern of person j. Treating Xj as fixed, g 2 is given by 

N„ - E(N„ |Xj) 

(ley ' 

V'ar(AyXj)* ’ 
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with 

k 

E(N,j |Xj) = J2 = X,ilX,) (44) 

9=1 

and 



k 

VariK, |Xj) = £ P x (X gx = X^) (1 - P t (X gi = X^X,)] . (45) 

9=1 

Note that in determining P^TV^IXj) and Var(N i j\'X.j) it is assumed that local in- 
dependence holds given Xj. Also note that a latent trait is not assumed. The obvious 
problem is here how to determine Pi(X gi = X gj \ X ; ). Frary et al. (1977) estimated these 
probabilities using estimates of n g and the distractor answering proportions, and the ratio 
of the copier’s number-correct score to the mean number-correct score for all examinees. 
As Wollack (1997) pointed out, g 2 does not take the trait level of the copier into account 
and, as a result, g 2 implicitly assumes that the item discriminations and the probabilities 
of selecting a particular distractor are constant across examinees with different trait levels. 

As an alternative, Wollack (1997) proposed a statistic, u, similar to g 2 , but the proba- 
bilities associated with each response were determined using the nominal response model 
(Bock, 1972). This model specifies the probability of an examinee i with trait level 0 X 
selecting alternative u of item g. Treating X ; as fixed, 



Ifii-EjN,, |»,,Xj) 

Vor(JV„|ft,Xj)t ' 

where 

k 

EW.Xj) =^P(X Ji = X,#, Xi) 

9=1 

and 
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k 

Var(N i j\d i X i ) = £ P(X* = Xj)[l -P(X gi = Xj)]. (48)' 

9=1 

is assumed to be asymptotically normally distributed. Using simulated data, Wollack 
(1997) found that the empirical Type I error rate was lower than the nominal Type I error 
rate. For most cases, however, this empirical Type I error rate was more in agreement 
with the nominal Type I error rate than similar results for the empirical and nominal Type 
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I error rates for gi- As a result, the detection rate of u> also was higher than of <7 2 - 

Because <72 and u> were defined so as to detect specific types of aberrant behavior, for 
this purpose they are more powerful than general person-fit statistics. Some problems, 
however, seem to justify more research in this area. For example, as Wollack ( 1997) noted, 
when copying is involved, 9{ is confounded with 6j and, because 9 t is taken as the copier’s 
trait level to determine the probabilities used in u>, this statistic also may be confounded. 
In fact, the copier completes the test using two different 9 values, his/her own and that 
from the source. Before answer copying is investigated, it may therefore be interesting 
to test by means of a person-fit statistic how serious the unidimensionality assumption 
is violated; that is, the fit of a person to the model should be investigated first. When 
an item score pattern fits the model, the researcher can have confidence in the validity 
of 9 and its use in w; when an item score pattern does not fit the model, care should be 
taken in interpreting u, because the interpretation of 9 may be ambiguous. Furthermore, 
because the normality assumption is based on asymptotic sampling theory, ui seems to be 
less suited for short tests (say, 40 items or less). Moreover, because the standardization of 
u> is similar to the standardization of the person-fit statistic l z , using 9 instead of 9 also is 
likely to influence the distribution of u>. 



Discussion 



Which statistic should be used ? 

We discussed several methods that can be used to investigate the aberrance of individual 
item score patterns under particular IRT models. The methods ranged from statistics for 
testing whether an item score pattern is in agreement with the other patterns in the sample, 
to methods for investigating whether persons have been copying the correct answers from 
other examinees. Depending on the type of data and the problems.envisaged, a researcher 
may choose a particular statistic, although not all statistics have equally favorable prop- 
erties in a statistical sense. For example, for short tests and tests of moderate length (say, 
10-60 items) and using the standard normal distribution, due to the use of 9 rather than 
9 for most statistics the nominal Type I error rate is not in agreement with the empiri- 
cal Type I error rate. Recently, Snijders (1998) proposed statistical theory for correcting 
the bias caused by using the maximum likelihood estimate 9 rather than 9. In general, 
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sound statistical methods have been derived for the Rasch model, but because this model 
is rather restrictive to empirical data, the use of these statistics also is restricted. 

In general, it may be wise to first investigate possible threats to the fit of individ- 
ual item score patterns before using a particular person-fit statistic. If one suspects that 
answer copying is a realistic threat, one of the answer copying statistics can be used as 
an alternative to a person-fit statistic. As another example, if violations against local in- 
dependence are expected, one of the methods proposed bv Klauer (1991) may be used 
instead of a general statistic such as proposed by Molenaar and Hoijtink (1990). Not only 
are tests against a specific alternative more powerful than general statistics, also the type 
of deviance is easier to interpret. Statistics like the M statistic proposed by Molenaar and 
Hoijtink (1991) are helpful in situations when the researcher has no idea which threats are 
most important. Statistics like U B (Equation 16) and U\V (Equation 17) or the person- 
response function can be used as diagnostic tools to test whether item score patterns on a 
priori specified subtests fit the IRT model. 

A drawback of some person-fit statistics is that only deviations against the model are 
tested. This may result in interpretation problems. For example, item score patterns not 
fitting the Rasch model may be described more appropriately by means of the 3-PLM 
or may be flagged by a statistic for answer copying. If the Rasch model does not fit 
the data, other explanations are possible. Because in practice it is often difficult, if not 
impossible, to substantially distinguish different types of item score patterns and/or to 
obtain additional information using background variables, a more fruitful strategy may 
be to test against specific alternatives. 

Almost all statistics are of the form given in Equation (11) but the weights are differ- 
ent. The question then is which statistic should be used? The literature told us that the use 
of a statistic depends on what kind of model is used. Using the Rasch model, the theory 
presented by Molenaar and Hoijtink and their statistic M are a good choice. Statistic M 
should be preferred over statistics l z or ZW because the critical values for M are more 
accurate than those of l z and ZW or those of other statistics. Statistic M is available in 
the computer program RSP (Glas & Ellis, 1993) so that the practitioner can easily add the 
person-fit values to his/her dataset. With respect to the 2-PLM and 3-PLM, all statistics 
proposed suffer from the problem that the standard normal distribution is inaccurate when 
9 is used instead of 9. This seriously reduces the applicability of these statistics. The the- 
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ory recently proposed by Snijders (1998) may help the practitioner to obtain the correct 
critical values. Another argument for the use of a likelihood-based statistic is that it is 
an increasing function of the probability of a score pattern under a model. It can easily 
be shown that residual-based statistics like the ones given in Equations (16) and (17) do 
not reflect the probability ordering of the score patterns because 1 / {P g {9) [1 - P 9 (0)]} 
is not an increasing function in P g {9). 

In a nonparametric context, ZUZ may be preferred over the other fit statistics (like C'*) 
because this statistic is also an increasing function of the probability of the score pattern 
and, moreover, the distribution of ZUZ is known to be standard normal conditional on the 
total score. However, it is unknown whether the empirical distribution is in agreement 
with the theoretical distribution when nonparametric IRT models are used. 

Can Person-Fit Statistics Improve Measurement Practice ? 

The aim of person-fit measurement is to detect item score patterns that are improbable 
given an IRT model or given the other patterns in a sample. The first requirement thus 
is that person-fit statistics are sensitive to nonfitting item score patterns. After having 
reviewed the studies using simulated data, it can be concluded that detection rates are 
highly dependent on (1) the type of aberrant response behavior, (2) the 9 value, and (3) 
the test length. When item score patterns do not fit an IRT model, high detection rates can 
be obtained in particular for extreme 9s, even when Type I errors are low (e.g., .001). The 
reason is that for extreme 9s deviations from the expected item score patterns tend to be 
larger than for moderate 9s. As a result of this pattern misfit, the bias in 9 tends to be larger 
for extreme 9s than for moderate 9s (Meijer & Nering, 1997). The general finding that 
detection rates for moderate 9s tend to be lower than for extreme 9s thus is not such a bad 
result and certainly puts the disappointment some authors (e.g., Reise, 1995) expressed 
about low detection rates for moderate 9s in perspective. 

Relatively few studies have investigated the usefulness of person-fit statistics for an- 
alyzing empirical data. The few studies that exist have found some evidence that groups 
of persons with a priori known characteristics, such as test takers lacking motivation, may 
produce deviant item score patterns that are unlikely given the model. However, again it 
depends on the degree of aberrance of response behavior how useful person-fit statistics 
really are. We agree with some authors (Rudner et al., 1996; Reise & Flannery, 1996) 
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that new empirical research is needed, but it should be noted that more empirical studies 
do not provide the answer to the question whether person-fit statistics can be helpful in 
improving measurement practice. Empirical studies can illustrate the use of a person-fit 
statistic. For example, an empirical study may show that examinees that are unmotivated 
to fill out a questionnaire can be detected using a particular person-fit statistic. Whether 
person-fit statistics can help the researcher in practice depends on the context in which 
research takes place. * 

Smith (1985) mentioned four actions that could be taken when an item score pattern 
is classified as aberrant. (1) Instead of reporting one ability estimate for an examinee, 
several ability estimates can be reported on the basis of subtests that are in agreement 
with the model; (2) modify the item score pattern (for example eliminate the unreached 
items at the end) and re-estimate ability; (3) do not report the ability estimate and retest 
a person; or (4) decide that the error is small enough for the impact on the ability to be 
marginal. This decision can be based on comparing the error introduced by measurement 
disturbance and the standard error associated with each ability estimate. Which of these 
actions is taken very much depends on the context in which testing takes place. The 
usefulness of person-fit statistics thus also depends heavily on the application for which 
it is intended. 
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Suggestions for Future Research 

When reviewing the person-fit literature some suggestions for future research come to 
mind: 

1. With respect to the distributional characteristics of the person-fit statistics, for the 1- 
PLM sound statistical theory helps the researcher to decide when an item score pattern 
is improbable. For the 2-PLM and the 3-PLM problems exist because 6 instead of 
0 is used and this makes the estimation of the probability of exceedance unreliable. 
Research of methods is needed that correct for using 6 . Snijders (1998) proposed 
a correction of the variance of a group of person-fit statistics. However, also the 
skewness and kurtosis should be taken into account, especially when the nominal 
Type I levels are small. In that case, nominal Type I error levels and empirical Type I 
error levels are not in agreement (e.g. van Krimpen-Stoop & Meijer, in press-a). 

2. With the increasing use of computerized adaptive testing (CAT) research, the use of 
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person-fit statistics in this context also may be investigated. McLeod & Lewis (1998) 
showed that it indeed makes a difference when a person has preknowledge of some of 
the items in a CAT and, as a result of that, can obtain a higher total score. Therefore, 
detection of such persons is important. Nering (1997) and van Krimpen-Stoop and 
Meijer (in press-a) showed that in a CAT the distributional characteristics of existing 
person-fit statistics are far off the expected distributions. Moreover, the characteris- 
tics of a CAT are unfavorable for person-fit research: there are relatively few items 
compared to a paper-and-pencil test and this results in a lower detection rate. Fur- 
thermore, for each examinee the spread in the item difficulties is small by definition. 
McLeod & Lewis (1998) discussed a Bayesian approach for the detection of exam- 
inees with preknowledge of the items. More research, however, is needed. A possi- 
bility is to work with statistics that are especially designed for a CAT. For example, 
in a CAT it is assumed that there is an alternation of correct and incorrect responses. 
A number of consecutive correct or incorrect answers is unexpected and may be the 
result of aberrant response behavior. Person-fit statistics that are especially designed 
may be more powerful than ’’conventional” person-fit statistics, and the statistical 
properties of the former statistics should be less susceptible to the characteristics of a 
CAT (van Krimpen-Stoop & Meijer, in press-b). 

3. Few studies analyze empirical data using person-fit statistics. An explanation may be 
that person-fit statistics only inform the researcher that a score pattern does not fit the 
model without giving extra information about the type of aberrance. There is also the 
difficulty of distinguishing between examinees with item score patterns for whom the 
wrong IRT model is used and those whose item score patterns can be explained using 
additional information. Studies are needed that analyze empirical data together with 
background variables to obtain extra information about the type of aberrance. Reise 
and Flannery (1996) mentioned the application of person-fit research in cross-cultural 
studies to investigate the scalability of examinees with a different ethnic background. 

4. In the context of nonparametric IRT modeling, few statistics exist that test a response 
pattern against model assumptions using known statistical properties. Much more 
research is needed here to obtain sound statistical methods. 

5. One of the biggest problems of using person-fit statistics in practice is the relatively 
low power of these statistics in detecting aberrant item score patterns. Testing against 
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a specified alternative may be a solution. More information is needed, however, about 
the influence of aberrant response behavior on the total score on a test. 

6. To enhance the interpretation of nonfitting item score patterns it may be possible to 
determine the fit of an item score pattern using the item difficulties determined in 
a well-defined group of examinees or on the basis of a cognitive theory. Aberrant 
response patterns may be more easily interpreted using such external frames of ref- 
erence. 

7. The person response function also may be used to enhance the interpretation of aber- 
rant item score patterns. Because a plot of the observed and expected response func- 
tions immediately clarifies which groups of observed responses disagree with the 
expected responses, the researcher may more easily hypothesize the explanation of 
the aberrant item score patterns. More research is needed here. 
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